They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
نویسندگان
چکیده
Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or foreign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourcThere has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. ing to both address the lack of appropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL R ©). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each trigram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bigram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subsequent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • LM: A 4-gram language model trained on the Google Web1T 5-gram Corpus with SRILM (Stolcke, 2002). • PERC: An averaged Perceptron (Freund and Schapire, 1999) classifier— as implemented in the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008).
منابع مشابه
Grading, no longer an obstacle to learners’ attendance to teacher feedback
Learners are often reported not to be motivated enough to attend to teacher feedback. Teachers also tend to grade learners’ writing samples when providing them with corrective feedback though they know it may divert their attention away from teacher feedback. However, not grading learner writings does not seem to be an option due to both learners’ demands for it and ins...
متن کاملGrammatical Error Correction of English as Foreign Language Learners
This study aimed to discover the insight of error correction by implementing two correction systems on three Iranian university students. The three students were invited to write four in-class essays throughout the semester, in which their verb errors and individual-selected errors were corrected using the Code Correction System and the Individual Correction System. At the end of the study, the...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملPerform Three Data Mining Tasks with Crowdsourcing Process
For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...
متن کاملPerformance Evaluation of Medical Image Retrieval Systems Based on a Systematic Review of the Current Literature
Background and Aim: Image, as a kind of information vehicle which can convey a large volume of information, is important especially in medicine field. Existence of different attributes of image features and various search algorithms in medical image retrieval systems and lack of an authority to evaluate the quality of retrieval systems, make a systematic review in medical image retrieval system...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011